CopulaHiC is R package for differential analysis of Hi-C data. It takes advantage of significant correlations of main diagonals between different Hi-C data sets (cell lines, experiments, etc.) - usually first 200 to 300 diagonals from main diagonal are considered. CopulaHiC uses copulas to model these dependancies and then quantifies deviations from the model in a probabilistic way.
This vignette contains description of our method: - basic introduction on Hi-C data and its related biases, - issues with Hi-C differential analysis, - joint normalization of Hi-C data via copula models, - detection of differential significantly interacting regions.
If you want to spare yourself extra reading and would like to jump straight to analysis part see quick start chapter. It contains step by step introduction to Hi-C differential analysis with our package on example Hi-C data set.
Load package:
library("CopulaHiC")To compare Hi-C experiments you need 2 files with Hi-C matrices. In this tutorial we will use Hi-C data sets of human IMR90 and human MSC cells lines from GSE63525 and GSE52457 studies respectively available as sample data in CopulaHiC package. For simplicity we will only use chromosome 18. First read data:
mtx.fname.imr90 <- system.file("extdata", "IMR90-MboI-1_40kb-raw.npz",
package = "CopulaHiC", mustWork = TRUE)
mtx.fname.msc <- system.file("extdata", "MSC-HindIII-1_40kb-raw.npz",
package = "CopulaHiC", mustWork = TRUE)
hic.comparator <- HiCcomparator(mtx.fname.imr90, mtx.fname.msc, mtx.names = c("18"), do.pca = TRUE)Illustrate Hi-C contact maps with A/B compartments:
dense.imr90 <- sparse2dense(hic.comparator$maps1[["18"]][c("i","j","val")],
N = hic.comparator$maps.dims[["18"]][1,"n.rows"])
dense.msc <- sparse2dense(hic.comparator$maps2[["18"]][c("i","j","val")],
N = hic.comparator$maps.dims[["18"]][2,"n.rows"])
# there are some troubles with adjusting margins, for a momement it must be done manually
layout(matrix(c(1,2,3,4), ncol = 2, byrow = TRUE), widths = c(2,2), heights = c(2,1))
par(mar = c(2,1,0.5,1))
plot_contact_map(dense.imr90)
par(mar = c(1,1,1,1))
plot_contact_map(dense.msc)
plot_pc_vector(hic.comparator$pc1.maps1[["18"]])
plot_pc_vector(hic.comparator$pc1.maps2[["18"]])IMR90 and MSC contact maps with A/B compartments of human chromosome 18.
Determine TADs for both Hi-C maps and illustrate them:
tads.imr90 <- map2tads(dense.imr90, resolution = 40*1000, window.bp = 400*1000, delta.bp = 80*1000)
tads.msc <- map2tads(dense.msc, resolution = 40*1000, window.bp = 400*1000, delta.bp = 80*1000)
plot_with_inset(list(dense.imr90), c(600,900), c(600,900), args.regions = list(tads.imr90))
plot_with_inset(list(dense.msc), c(600,900), c(600,900), args.regions = list(tads.msc))IMR90 contact map of human chromosome 18 with TADs.
MSC contact map of human chromosome 18 with TADs.
Compare coverage and decays for IMR90 and MSC, for decays use log10 scales for both X and Y axis:
library("ggplot2")
coverage <- dominating_signal(hic.comparator, which.signal = "coverage")
ggplot(coverage, aes(x = i, y = sum.contacts, color = dataset)) +
geom_point(size = 0.5) +
geom_smooth(alpha = 0.5) +
facet_wrap(~ name, ncol = 1, scales = "free") +
theme(legend.position = "bottom")
decay <- dominating_signal(hic.comparator, which.signal = "decay")
ggplot(decay[decay$diagonal != 0,], aes(x = diagonal, y = mean.contacts, color = dataset)) +
geom_point(size = 0.5) +
scale_x_log10() +
scale_y_log10() +
facet_wrap(~ name, ncol = 1, scales = "free") +
theme(legend.position = "bottom")Coverage comparison between IMR90 and MSC Hi-C data of human chromosome 18.
Decay comparison between IMR90 and MSC Hi-C data of human chromosome 18.
Calculate fold change and differential maps IMR90 / MSC and IMR90 - MSC respectively. Illustrate results with zoomin of 600 - 900 bins region. By default when plotting fold change maps log10 scale is used. When plotting differential map one should indicate sqrt transformation of data for better visibility (see examples below):
merged <- merge(hic.comparator)
merged.18 <- merged[["18"]]
merged.18$fc <- merged.18$val.x / merged.18$val.y
merged.18$difference <- merged.18$val.x - merged.18$val.y
dense.fc <- sparse2dense(merged.18[c("i","j","fc")], N = hic.comparator$maps.dims[["18"]][1,"n.rows"])
dense.diff <- sparse2dense(merged.18[c("i","j","difference")],
N = hic.comparator$maps.dims[["18"]][1,"n.rows"])
plot_with_inset(list(dense.fc, fc = TRUE, colors.pal = c("blue","white","red")), c(600,900), c(600,900), which.map = "contact.map")
plot_with_inset(list(dense.diff, breaks = 100, sqrt.transform = TRUE), c(600,900), c(600,900), which.map = "diff.map")Fold Change map of IMR90 / MSC of human chromosome 18.
Differential map of IMR90 - MSC of human chromosome 18.
Check correlations between corresponding diagonals of IMR90 and MSC:
library("reshape2")
decay.cors <- decay_correlation(hic.comparator)
# wide to long
decay.cors.long <- reshape2::melt(decay.cors[c("name","diagonal","pcc","rho","tau")],
id.vars = c("name","diagonal"), variable.name = "correlation",
value.name = "coefficient")
# remove 0 diagonal (as it is non informative anyways) and illustrate results
ggplot(decay.cors.long[decay.cors.long$diagonal != 0,],
aes(x = diagonal, y = coefficient, color = correlation)) +
geom_point(size = 0.7) +
scale_x_continuous(breaks = c(1,seq(0, max(decay.cors.long$diagonal), by = 250)[-1])) +
facet_wrap(~ name, ncol = 1, scales = "free") +
theme(legend.position = "bottom")
dc <- decay.cors[c("name","diagonal","pearson.pval","spearman.pval","kendall.pval")]
colnames(dc) <- c("name","diagonal","pearson","spearman","kendall")
decay.sig.long <- reshape2::melt(dc, id.vars = c("name","diagonal"),
variable.name = "correlation", value.name = "pval")
decay.sig.long$neg.log.pval <- -log10(decay.sig.long$pval)
ggplot(decay.sig.long[decay.sig.long$diagonal != 0,],
aes(x = diagonal, y = neg.log.pval, color = correlation)) +
geom_point(size = 0.7) +
geom_hline(yintercept = -log10(0.01)) +
annotate("text", 1800, -log10(0.01), vjust = -1, label = "pval = 0.01") +
scale_x_continuous(breaks = c(1,seq(0, max(decay.sig.long$diagonal), by = 250)[-1])) +
ylab("-log10(pval)") +
facet_wrap(~ name, ncol = 1, scales = "free") +
theme(legend.position = "bottom")Correlations between diagonals of IMR90 and MSC Hi-C data of human chromosome 18.
Significances of correlations between diagonals of IMR90 and MSC Hi-C data of human chromosome 18.
Construct Hi-C copula models (if you are running this on a cluster you can increase number of cores - it will speed up model construction significantly):
hic.copula <- HiCcopula(hic.comparator, diagonals = 1:240, n.cores = 1)Plot marginals gamma fit for diagonal 10 and copula fit for diagonals, say 1, 10, 50 and 200:
model.18.1 <- hic.copula$model[["18"]][["1"]]
model.18.10 <- hic.copula$model[["18"]][["10"]]
model.18.50 <- hic.copula$model[["18"]][["50"]]
model.18.200 <- hic.copula$model[["18"]][["200"]]
library("fitdistrplus")
library("VineCopula")
plot(model.18.10$marginal.x)
plot(model.18.10$marginal.y)
c1 <- plot(model.18.1$bf.copula, main = "diagonal 1")
c2 <- plot(model.18.10$bf.copula, main = "diagonal 10")
c3 <- plot(model.18.50$bf.copula, main = "diagonal 50")
c4 <- plot(model.18.200$bf.copula, main = "diagonal 200")
grid.arrange(c1,c2,c3,c4, ncol=2)Gamma fit of marginal distribution X (IMR90) of diagonal 10.
Gamma fit of marginal distribution Y (MSC) of diagonal 10.
Copula fit of U (IMR90) vs V (MSC) for diagonals 1,10,50 and 200.
When background model is constructed differential map can be computed (and visualized):
diffmap <- hicdiff(hic.copula)
diffmap.dense <- hicdiff2mtx(diffmap, hic.copula$maps.dims)
# plot diffmap
diffmap.18.dense <- diffmap.dense[["18"]]
plot_with_inset(list(diffmap.18.dense, breaks = 100), c(1300,1700), c(1300,1700), which.map = "diff.map")Differential/Significance map of human chromosome 18 IMR90 vs MSC cell lines.
Finally groups of cells with significant depletion/enrichment can be identified (with some precision) as rectangle like regions containing connected components:
lrdi <- differential_interactions(hic.copula)
# get rectangle like regions
regions <- lrdi[["18"]]$interacting.regions
# plot them --> only those which contain at least 5 cells inside connected component
plot_with_inset(list(diffmap.18.dense, breaks = 100), c(1300,1700), c(1300,1700), which.map = "diff.map",
args.regions = list(regions[regions$n.cells >= 5, 2:6], pal.colors = c("blue","red")))
plot_with_inset(list(diffmap.18.dense, breaks = 100), c(900,1200), c(900,1200), which.map = "diff.map",
args.regions = list(regions[regions$n.cells >= 5, 2:6], pal.colors = c("blue","red")))Detection of differential long range interactions in IMR90 vs MSC map. Bilinear model fit to depleted (negative) significances in order to determine threshold for long range interactions.
Detection of differential long range interactions in IMR90 vs MSC map. Bilinear model fit to enriched (positive) significances in order to determine threshold for long range interactions.
Differential/Significance map of human chromosome 18 IMR90 vs MSC cell lines with long range differential interactions (zoomin 1).
Differential/Significance map of human chromosome 18 IMR90 vs MSC cell lines with long range differential interactions (zoomin 2).
Contact map of human chromosome 18 IMR90 cell line in 40kb resolution.
Hi-C is a complex protocol with many sources of bias (Yaffe and Tanay 2011), which make analysis incorporating this type of data challenging. Therefore almost any biological study utilizing Hi-C data performed without proper normalization will lead to nonmeaningfull results.
Here we consider a problem of comparing 2 Hi-C datasets. Depending on experimental design it may be for example different cell lines or different experimental conditions. This sort of studies apper relatively often in literature indicating high importance of this problem (Le Dily et al. 2014; Dixon et al. 2015; Lun and Smyth 2015). We present key issues arising in this type of assays and suggest a model, which have potential to improve the reliability of Hi-C differential analysis.
Coverages (sum of contacts per region) of human chromosome 19 from 2 Hi-C experiments: IMR90 and MSC cell lines.
Decays (mean of contacts per diagonal) of human chromosome 19 from 2 Hi-C experiments: IMR90 and MSC cell lines.
Fold Change map of IMR90 / MSC of human chromosome 19.
Coverages (sum of contacts per region) of human chromosome 19 from 2 Hi-C experiments: IMR90 and MSC cell lines after Iterative Correction.
Decays (mean of contacts per diagonal) of human chromosome 19 from 2 Hi-C experiments: IMR90 and MSC cell lines after Iterative Correction.
Fold Change map of IMR90 / MSC of human chromosome 19 with both maps iteratively corrected.
A possible workaround could be joint normalization method, which tries to model similarities in coverage and contact decay between both Hi-C experiments simultaneously and then seeks for deviations from the model.
Correlations between corresponding diagonals (i.e. 1 vs 1, 2 vs 2, etc) of IMR90 replicate 1 vs replicate 2 Hi-C contact map of human chromosome 19.
Significances of the above correlations.
Correlations between corresponding diagonals (i.e. 1 vs 1, 2 vs 2, etc) of IMR90 vs MSC Hi-C contact map of human chromosome 19.
Significances of diagonals correlations.
The trends shown above are representative for many other cell types and chromosomes analyzed by us.
The above observation prompted us to model dependence between Hi-C contact maps as global “uninteresting” signal and then try to find deviations from the model as local, relevant differences. The common way to model dependencies between datasets is with the use of copulas. Copulas are multivariate distributions with uniform marginals. They allow to separate modelling of marginal distributions from modelling of dependancies between random variables. Copulas have a long history in data modelling, for more details on the topic see for example (embrechts2001modelling; Trivedi, Zimmer, and others 2007). A great advantage of using copulas is that there exist rich and well documented R packages dedicated to copula modelling, which makes model selection quick and straightforward process (Hofert et al. 2017; Jun Yan 2007; Ivan Kojadinovic and Jun Yan 2010; Marius Hofert and Martin Mächler 2011; Schepsmeier et al. 2018).
Our package is build on top of VineCopula library, which can easily handle model selection for bivariate distributions. The pipeline can be summarized with below panels.Input, a pair of Hi-C contact maps to be compared.
Model selection and p-value calculation.